1 Introduction

The ptairMS package provides a workflow to process PTR-TOF-MS raw data in the open Hierarchical Data Format 5 (HDF5; .h5 extension), and generate the peak table as an ExpressionSet object for subsequent data analysis with the many methods and packages available in R (see for instance (???)). Applications include the analysis of exhaled breath, cell culture headspace or ambient air. The package offers several features to check the raw data and tune the few processing parameters. It also enables to include new samples in a study without re-processing all the previous data, providing a convenient management for cohort studies (e.g. duration of the inclusion process, longitudinal studies, etc.).

Proton Transfer Reaction - Mass Spectrometry (PTR-MS) has emerged with excellent sensitivity and specificity for VOC analysis in a wide range of applications, including environment, food quality, and biology (Blake, Monks, and Ellis 2009). In the area of health and care, PTR-MS opens up unique opportunities for real-time analysis at the patient’s bedside.

2 Volatolomics

The characterization of volatile organic compounds (VOCs) emitted by living organisms is of major interest in medicine, food sciences, and ecology. As an example, thousands of VOCs have been identified in the exhaled breath, resulting from normal metabolism or pathological processes (Lacy Costello et al. 2014). The main advantage of breath analysis in medicine is that the sampling is non-invasive (Devillier et al. 2017). Methods based on mass spectrometry (MS) are the reference technologies for VOC analysis because of their sensitivity and large dynamic range.

3 The ptairMS processing workflow

The workflow consists of five steps:

  1. createPtrSet: A ptrSet object is generated by taking as input the name of the directory containing the raw files (in HDF5 format), possibly grouped into subfolders according to classes of samples

  2. detectPeak: peak detection and quantification are performed within each file and the ptrSet object is updated with the sample metadata, the peak list for each sample, and several quality metrics

  3. alignSamples: The peak lists are aligned between samples and an ExpressionSet object is returned, containing the table of peak intensities, the sample metadata, and the feature metadata (which can be accessed with the exprs, pData and fData methods from the Biobase package, respectively)

  4. imputing: Missing values in the table of intensities may be replaced by the integrated signal in the expected raw data region

  5. annotateVOC: Features may be annotated, based on the Human Breathomics Database (Kuo et al. 2020)

4 Datasets (ptairData package)

Two real PTR-TOF-MS raw data sets are available in the ptairData package.

  • exhaledAir: Exhaled air from two healthy individual: three acquisitions per individual (with two expirations per acquisition), on distinct days [6 files]

  • mycobacteria: cell culture headspace: two replicates from two species and one control (culture medium) [6 files]

Note: To limit the size of the data and speed up the analysis, the raw data are truncated in the mass dimension within the range \([20.4,21.6] \cup [50,150]\) for individuals, and \([20.4,21.6] \cup [56.4,90.6]\) for mycobacteria.

5 Hands on

library(ptairMS)
library(ptairData)

5.1 createPtrSet: Checking raw data and setting parameters

The analysis starts from a directory containing the raw data. For this example, we use the exhaledAir from the ptairData package:

dirRaw <- system.file("extdata/exhaledAir", package = "ptairData")

Before processing the data, there are two important steps:

  • calibrate the mass axis

  • determine the time periods of interest within the acquisitions (i.e., expirations or headspace analysis duration in our examples)

To perform these steps and check the quality of the raw data, we first use the createPtrSet function which, for each file:

  • reads the raw data

  • calibrates the mass axis with the mzCalibRef parameter (see the Calibrating the mass axis section)

  • determines the time limits of expirations or headspace duration on the Extraced Ion Chromatogram (EIC) at mzBreathTracer mass or on the Total Ion Chromatogram (TIC) if this parameter is NULL, with the fracMaxTIC parameter (see the Determine the time limits for expirations or headspace duration section)

  • builds a default sampleMetadata table with the file names, date and possibly the subdirectories if it exists

This function creates a unique ptrSet object for the whole study, which contains all necessary information about the files: calibration parameters, time limits, peak lists (as this step they are empty), and primary ion quantification.

Importantly, the ptrSet may be updated if new files are added or deleted from the directory, with the function updatePtrSet (see the Updating the ptrSet section).

To create a ptrSet object form the raw data directory:

exhaledPtrset <- createPtrSet(dir=dirRaw,
                     setName="exhaledPtrset",
                     mzCalibRef = c(21.022, 60.0525),
                     mzBreathTracer = NULL,
                     fracMaxTIC = 0.7,
                     saveDir = NULL )
exhaledPtrset
## ptrSet object : exhaledPtrset 
## directory: C:/Users/CR258086/Documents/R/R-4.0.3/library/ptairData/extdata/exhaledAir 
##     6 files contains in the directory 
##     6 files check 
##     0 files processed by detectPeak

To get the list of the file names in your directory (at the last time the ptrSet was created or updated), use getFileNames:

getFileNames(exhaledPtrset)
## [1] "ind1-1.h5" "ind1-2.h5" "ind1-3.h5" "ind2-1.h5" "ind2-2.h5" "ind2-3.h5"

To check the quality and view useful information about your files, you can use the plot method on the ptrSet, that provides four plots:

  • calibError: the calibration error boxplot in ppm

  • \(\frac{m}{\Delta m}\): an estimation of the resolution for the mzCalibRef peaks

  • the average normalized shape of the mzCalibRef peaks

  • The reaction parameters and primary ion isotope intensity in cps in function of acquisition date

plot(exhaledPtrset)

We will now quickly explain each step of this function, and show how to choose and check the quality of the mzCalibRef, mzBreathTracer and fracMaxTIC parameters.

5.1.1 Calibrating the mass axis

To convert Time Of Flight (TOF) axis to mass-to-charge ratio (mz), the following formula is used (Müller et al. 2013):

  • \(mz = \Big (\frac{TOF-b}{a} \Big)^2\)

To estimate the parameters (a,b), reference peaks with known masses (and without overlapping peaks) are needed. For optimal results, the calibration peaks have to be well distributed over the entire m/z axis. The masses of the calibration peaks are set with the mzCalibRef parameter, in an numeric vector. For exhaled air, we propose by default 6 peaks (two of them are suggested by the IONICON manufacturer):

calib_table<-read.csv(system.file("extdata", "reference_tables/calib_table.tsv", package = "ptairMS"),sep="\t")
knitr::kable(calib_table)
Formula mz Compound
H3 (18O)+ 21.02204 Isotope of primary ion (H3O+)
N2H+ 29.01342 Diazote
C3H5+ 41.03858 Allyl cation
C3H7O+ 60.05250 Isotope of Acetone
C6H6I+ 203.94300 Iodobenzene (1,3-diiodobenzene fragment)
C6H5I2+ 330.84950 1,3-diiodobenzene (instrument internal standard)

Other reference masses (at least two) can be provided by the user.

If a mass contains one or several outlier values on the summary plot (i.e. an error > 20 ppm) for specific files, you can check those files with the function plotCalib that plots the Total Ion Spectrum (TIS) around all the mzCalibRef (with the exact masses as red vertical lines).

plotCalib(exhaledPtrset,fileNames=getFileNames(exhaledPtrset)[5])

You can then choose to keep or delete these masses from mzCalibRef. To change the calibration masses, use the calibration function on the ptrSet object.

In our dataset example, since we have truncated the mass axis, only two masses from those listed above are present. To calibrate with three masses, we therefore add the peak 75.04406 ([C3H6O2+H]+, Hydroxyacetone).

exhaledPtrset <- calibration(exhaledPtrset, mzCalibRef =  c(21.022, 60.0525,75.04406))
plot(exhaledPtrset,type="calibError")

5.1.2 Determine the time limits of expirations or headspace duration

To determine the expirations or headspace duration we use the timeLimits function. We can apply this function on a ptrSet, or a ptrRaw object, obtain thanks to the readRaw function (see section Processing a single raw file).

The limits are determined on the Total Ion Chromatogram (TIC) if mzBreathTracer is NULL, or in the Extracted Ion Chromatogram (EIC) around mzBreathTracer mass. They correspond to the part of the TIC/EIC where intensity is higher than fracMaxTIC * max(TIC), after baseline removal if baseline is TRUE.

Note: To analyze the entire spectrum (e.g. in ambient air studies), set fracMaxTIC = 0.

Example of expiration detection at fracMaxTIC = 0.5 on a single file:

Example of expiration detection at fracMaxTIC = 0.9:

To see and change time limits, use the changeTimeLimits function that opens the Shiny application for interactive visualization and selection.

exhaledPtrset <- changeTimeLimits(exhaledPtrset)

Note: You can also see the expiration limits on all or several files from the ptrSet with the plotTIC function. By default, when fileNames is NULL, all TICs files are plotted. A pdf file may be generated by setting the pdfFile argument to the absolute file path ending with the .pdf extension. Finally, you can remove the baseline by setting baselineRm = TRUE, and add the time limits to the plot by setting showLimits = TRUE. Coloring the TICs according to a column from the sampleMetadata is also possible by indicating the column name as the colorBy parameter.

plotTIC(object = exhaledPtrset, showLimits = TRUE, pdfFile="AllTIC.pdf")
plotTIC(object = exhaledPtrset, showLimits = TRUE,fileNames=getFileNames(exhaledPtrset)[1],type = "ggplot")

To change the fracMaxTIC parameter for all the files in the ptrSet, use the timeLimits function on the ptrSet:

exhaledPtrset<-timeLimits(exhaledPtrset,fracMaxTIC = 0.8)

5.1.3 Managing sample metadata

The createPtrSet function automatically generates a default sampleMetadata data.frame. It contains the file names as row names,a column named ‘subfolder’ when the files are organized into subfolders in the parent directory, and a column date with the date and hour of the acquisition. To get this data frame, use the getSampleMetadata method. Remember that the row names of the sampleMetadata must always correspond to all the file names from the directory.

getSampleMetadata(exhaledPtrset)
##           subfolder                date
## ind1-1.h5      ind1 21/02/2019 11:52:49
## ind1-2.h5      ind1 27/02/2019 11:41:24
## ind1-3.h5      ind1 28/02/2019 11:28:29
## ind2-1.h5      ind2 07/02/2019 10:29:33
## ind2-2.h5      ind2 18/02/2019 11:06:52
## ind2-3.h5      ind2 22/02/2019 09:52:20

You can at any moment obtain this default sample metadata with the function resetSampleMetadata(exhaledPtrset).

To modify the sample metadata, there exists two different ways:

  • setSampleMetadata method: modify the data frame in your R session and set it back into the ptrSet object.
sampleMD <- getSampleMetadata(exhaledPtrset)
colnames(sampleMD)[1] <- "individual"  

exhaledPtrset <- setSampleMetadata(exhaledPtrset,sampleMD)
getSampleMetadata(exhaledPtrset)
##           individual                date
## ind1-1.h5       ind1 21/02/2019 11:52:49
## ind1-2.h5       ind1 27/02/2019 11:41:24
## ind1-3.h5       ind1 28/02/2019 11:28:29
## ind2-1.h5       ind2 07/02/2019 10:29:33
## ind2-2.h5       ind2 18/02/2019 11:06:52
## ind2-3.h5       ind2 22/02/2019 09:52:20
  • exportSampleMetada and importSampleMetadata methods: exportSampleMetada saves the data.frame in the .tsv format in the directory of your choice. The parameter saveFile must always have the .tsv extension. You can open and modify it on any spreadsheet editor, but row names must never be modified: if one of the files happens to be missing from the row names during the subsequent import, an error will be thrown. Once the .tsv file has been modified, it can be imported back into the ptrSet object, with the function importSampleMetadata.
exportSampleMetada(exhaledPtrset, saveFile = file.path(DirBacteria,"sampleMetadata.tsv"))
exhaledPtrset <- importSampleMetadata(exhaledPtrset, file = file.path(DirBacteria,"sampleMetadata.tsv"))

5.1.4 Saving

When calling the createPtrSet function, you may use the saveDir argument to save the ptrSet object in the directory of your choice, with setName parameter as name (the .RData extension will automatically be added at the end of the file name). Subsequent import of the saved ptrSet object relies on the classical load function.

5.1.5 Plot raw data

There exist two functions for plotting the raw data:

  • plotRaw: Displays for a file the image of the matrix of intensities, the Extracted Ion Chromatogram (EIC), the Extracted Ion Spectrum (EIS), for the selected m/z and time ranges, and all the putative annotations available in the vocDB within this mz range (no peak detection has yet been done as this step)
plotRaw(exhaledPtrset, mzRange = 59 , fileNames = getFileNames(exhaledPtrset)[1],showVocDB = TRUE)

##    ion_mass ion_formula                           name_iupac
## 40 59.08552  [C4H10+H]+               butane|2-methylpropane
## 39 59.06037 [C2H6N2+H]+                      dimethyldiazene
## 38 59.04914  [C3H6O+H]+ prop-2-en-1-ol|propanal|propan-2-one
## 37 58.94261     [Ni+H]+                               nickel
  • plotFeatures: for all selected files, the average spectrum for all time periods is plotted, with the background superimposed as a dashed line. As with the plotTIC function, you can choose the type of display (plotly or ggplot), and the possibility to color spectra individually by indicating a column name of the sample Metadata
plotFeatures(exhaledPtrset,mz=21.022)
plotFeatures(exhaledPtrset,mz=59.049,type="ggplot",colorBy = "individual")

5.2 updatePtrSet: Updating the ptrSet

If you delete or add files to the directory after the ptrSet object has been created, the updatePtrSet function must be run.

exhaledPtrset <- updatePtrSet(exhaledPtrset)
## nothing to update

5.3 detectPeak: Peak detection and quantification

Now since we have checked that the calibration is efficient, and that the expiration or headspace time limits are correct, the peaks can be detected and quantified in each file with the detectPeak function, which works on the ptrSet object (and the corresponding raw files).

For each file in the directory, this function:

  • make a calibration of the mass axis every 30 seconde

  • detects peak on the total ion spectrum with an un-targeted peak picking algorithm

  • estimate the temporal evolution of each peak detected with a 2-dimensional model, wrote in the peakListRaw slot of the ptrSet object

  • quantify each peak in both exhaled air and background, in cps, ncps (normalized by the primary ion, H3(O18)+ * 488) and ppb [(Hansel et al. 1995)] units, wrote in peakListAligned slot of the ptrSet object

  • make two unilateral t-test between intensity of background and expiration/headspace (if there is any), and write the p-value in the peakListAligned slot

The two peakList are then written in the ptrSet object as a list of data.tables containing for each file all the peaks detected.

exhaledPtrset <- detectPeak(exhaledPtrset)
## ind1-1.h5 : 146 peaks detected 
## ind1-2.h5 : 161 peaks detected 
## ind1-3.h5 : 137 peaks detected 
## ind2-1.h5 : 151 peaks detected 
## ind2-2.h5 : 162 peaks detected 
## ind2-3.h5 : 157 peaks detected

If you want to restrict the detection to specific nominal masses, indicate them with the mzNominal argument: for example exhaledPtrset <- detectPeak(exhaledPtrset , mzNominal = c(5,60)).

The peak detection step may take a few minutes if there are many files and a large m/z range. You may parallelize the algorithm by setting parallelize = TRUE and giving the number of available cores of your computer in nbCores. To find the number of CPU cores available in your computer, use parallel::detectCores().

To see the resulting peak lists, use the getPeakList method. It returns a list of two elements:

  • raw: contains the peak list obtained for each expiration and background, i.e. a data.table with the following columns:

    • Mz: the mass center of the peak
    • lower and upper: the boundaris of the peak
    • overlap: 0 or 1 if there is an overlapping peak
    • quanti_cps - time: the quantification in cps at each time points
  • aligned: contains the peak list obtained after baseline removal, average in expiration and background, and the p-value of the t-test:

    • Mz: the mass center of the peak
    • quanti_unit: the mean of the quantification over the expiration/headspace time limits defined
    • background_unit: the mean of the quantification over the background time limits defined
    • diffAbs_unit: the mean of the quantification over the expiration/headspace time limits defined after subtracting the baseline estimated from the background points defined
    • pValLess/ pValGreater: the p-value of the unilateral t-test, who test that quantidication (in cps) of expiration points are less/greater than the intenisty of the background.
peakList<-getPeakList(exhaledPtrset)
head(peakList$aligned$`ind1-1.h5`)
##          Mz quanti_cps background_cps diffAbs_cps  quanti_ncps background_ncps
## 1: 21.02168  3632.5605      4283.6469   526.50584 1.775792e-03    2.094078e-03
## 2: 51.04351  5613.3892       121.7130  5352.69510 2.744128e-03    5.949986e-05
## 3: 53.00123   633.6348      1357.1559   834.76524 3.097549e-04    6.634510e-04
## 4: 53.03821   680.8930       631.1108   148.83399 3.328573e-04    3.085210e-04
## 5: 54.00821   218.5719       274.4212    20.03098 1.068497e-04    1.341519e-04
## 6: 54.03377   105.1643       254.4373   162.59445 5.140998e-05    1.243827e-04
##    diffAbs_ncps  quanti_ppb background_ppb diffAbs_ppb  pValGreater
## 1: 2.573845e-04 42.47278462    49.75262842 6.115128118 1.000000e+00
## 2: 2.616686e-03  2.91218241     0.06272414 2.758483674 1.667320e-20
## 3: 4.080783e-04  0.31524321     0.67071955 0.412549047 1.000000e+00
## 4: 7.275809e-05  0.33875494     0.31190106 0.073555198 2.547632e-01
## 5: 9.792224e-06  0.10655786     0.13289647 0.009700585 1.000000e+00
## 6: 7.948494e-05  0.05126955     0.12321865 0.078741094 1.000000e+00
##        pValLess
## 1: 1.433692e-07
## 2: 1.000000e+00
## 3: 1.353706e-14
## 4: 1.000000e+00
## 5: 4.006003e-02
## 6: 1.029143e-07

You can see the temporal evolution of each peak as follow:

mz60<-as.numeric(peakList$raw$`ind1-1.h5`[which.min(abs(Mz-60.05)),-c(1,2,3,4)])
plot(mz60)

A R Shinny application for visualization of modeled peaks and temporal estimation is in progress.

5.4 Updating the ptrSet peak lists with detectPeak

As described previously, an important feature of the ptairMS package is the possibility to update the ptrSet object linked to a directory, as new data files are added and included in this directory. In such a case, we have seen that the updatePtrSet function must be used to reset the sample metadata or append it. Then, the detectPeak function must be used on the updated ptrSet to compute the additional peak lists.

exhaledPtrset<-updatePtrSet(exhaledPtrset)
exhaledPtrset<-setSampleMetadata(exhaledPtrset,resetSampleMetadata(exhaledPtrset))
exhaledPtrset<-detectPeak(exhaledPtrset)

5.5 alignSamples: Aligning features between samples

The alignment between samples (i.e. the matching of variables between the peak lists within the ptrSet object) is performed by using a kernel gaussian density (Delabrière et al. 2017).

The alignSamples returns an ExpressionSet object, with the table of peak intensities which has just been built, the sample meta data (borrowed from the input ptrSet) and the variable meta data which contains peak intensities in the background.

It is possible to apply two filters:

  • On the background: only variables which that the pvalue of the t-test that intenisty are greater in expiration than background is greater than pValGreaterThres for fracExp % of the samples

  • On sample reproducibility: only variables which are detected in more than fracGroup percent of samples ( or group) are kept.

If you do not want to apply those filters, set fracGroup to 0 and pValGreaterThres to 1.

exhaledEset <- alignSamples(exhaledPtrset, group="individual", fracGroup = 1, pValGreaterThres = 1e-10, fracExp=1/6)
## 58 peaks aligned

The three tables from the ExpressionSet can be accessed with the classical exprs, pData, and fData accessors:

knitr::kable(head(Biobase::exprs(exhaledEset)))
ind1-1.h5 ind1-2.h5 ind1-3.h5 ind2-1.h5 ind2-2.h5 ind2-3.h5
51.0438 2.7584837 2.8017963 2.420703 8.8436299 6.7742669 11.0667678
52.0471 NA NA NA 0.1326104 0.1108963 0.1568612
53.0384 0.0735552 0.0216730 0.161987 0.4419335 0.4184362 0.3533443
55.039 42.6579729 29.0224228 32.876570 50.3174579 50.7659769 38.7360065
55.0534 0.5433896 0.4824238 0.191002 1.3209861 0.3628197 0.9762208
57.0327 1.3524605 1.3244174 1.322496 1.6224182 5.3902188 1.3535606
knitr::kable(head(Biobase::pData(exhaledEset)))
individual date
ind1-1.h5 ind1 21/02/2019 11:52:49
ind1-2.h5 ind1 27/02/2019 11:41:24
ind1-3.h5 ind1 28/02/2019 11:28:29
ind2-1.h5 ind2 07/02/2019 10:29:33
ind2-2.h5 ind2 18/02/2019 11:06:52
ind2-3.h5 ind2 22/02/2019 09:52:20
knitr::kable(head(Biobase::fData(exhaledEset)))
ion_mass Background - ind1-1.h5 Background - ind1-2.h5 Background - ind1-3.h5 Background - ind2-1.h5 Background - ind2-2.h5 Background - ind2-3.h5
51.0438 51.0438 0.0627241 0.0513964 0.0195223 0.0197875 0.0156695 0.0207212
52.0471 52.0471 NA NA NA 0.0078028 0.0010303 0.0021661
53.0384 53.0384 0.3119011 0.3143121 0.1778464 0.1749327 0.2928767 0.3383823
55.039 55.0390 0.7088711 0.5859148 0.6632503 1.1428881 0.2010073 0.4702364
55.0534 55.0534 2.9867440 3.9925226 2.7491979 2.4186937 4.9778561 3.3858188
57.0327 57.0327 1.0369904 1.2822502 0.7324881 0.8874536 1.0084791 1.0761186

Note: The view method from the ropls package may be used to print these tables:

ropls::view(exhaledEset, plotL = FALSE)
## 'exprs(x)':
##     dim  class    mode typeof size NAs     min mean median     max
##  58 x 6 matrix numeric double 0 Mb  12 5.9e-05   25   0.33 1.4e+03
##                   ind1-1.h5          ind1-2.h5 ...          ind2-2.h5
## 51.0438    2.75848367381112   2.80179625826249 ...   6.77426686815163
## 52.0471                <NA>               <NA> ...  0.110896347130384
## ...                     ...                ... ...                ...
## 141.0564 0.0771073794503359 0.0497287521818851 ... 0.0529753509355601
## 149.1155  0.193335382065498  0.136658656744426 ...   1.03683708474274
##                   ind2-3.h5
## 51.0438    11.0667677768043
## 52.0471   0.156861226919561
## ...                     ...
## 141.0564 0.0695380945157266
## 149.1155   2.35141933992167
## 'pData(x)':
##  individual      date
##   character character
##  nRow nCol size NAs
##     6    2 0 Mb   0
##           individual                date
## ind1-1.h5       ind1 21/02/2019 11:52:49
## ind1-2.h5       ind1 27/02/2019 11:41:24
## ...              ...                 ...
## ind2-2.h5       ind2 18/02/2019 11:06:52
## ind2-3.h5       ind2 22/02/2019 09:52:20
## 'fData(x)':
##  ion_mass Background - ind1-1.h5 ... Background - ind2-2.h5
##   numeric                numeric ...                numeric
##  Background - ind2-3.h5
##                 numeric
##  nRow nCol size NAs
##    58    7 0 Mb  12
##          ion_mass Background - ind1-1.h5 ... Background - ind2-2.h5
## 51.0438   51.0438     0.0627241413580302 ...      0.015669543137708
## 52.0471   52.0471                   <NA> ...     0.0010303385651669
## ...           ...                    ... ...                    ...
## 141.0564 141.0564     0.0308447234075375 ...   -0.00134929324091076
## 149.1155 149.1155      0.103722775310896 ...      0.148619347244205
##          Background - ind2-3.h5
## 51.0438      0.0207212498097142
## 52.0471     0.00216609082207191
## ...                         ...
## 141.0564     0.0486852034642205
## 149.1155      0.162922957125842

5.6 imputing: Imputation of missing values

To impute missing values, ptairMS returns back to the raw data, and performs the quantification with the same method as detectPeak but this time without any limit of detection.

exhaledEset <- ptairMS::impute(exhaledEset,  exhaledPtrset)
## ind1-1.h5 done
## ind1-2.h5 done
## ind1-3.h5 done
## ind2-1.h5 done
## ind2-2.h5 done
## ind2-3.h5 done

5.7 annotateVOC: Annotation

ptairMS provides putative annotations by matching the measured ion masses to the Human Breathomics Database (Kuo et al. 2020). Applied to an ExpressionSet, it appends the feature metadata (fData) with new columns containing chemical information (formulas, IUPAC name, InChI, etc.), as well as the isotopes for nuclides C13, O17 and O18.

annotateVOC(59.049)
##        vocDB_ion_mass vocDB_ion_formula                       vocDB_name_iupac
## 59.049       59.04914        [C3H6O+H]+ prop-2-en-1-ol, propan-2-one, propanal
exhaledEset<-annotateVOC(exhaledEset)
knitr::kable(head(Biobase::fData(exhaledEset)))
ion_mass Background - ind1-1.h5 Background - ind1-2.h5 Background - ind1-3.h5 Background - ind2-1.h5 Background - ind2-2.h5 Background - ind2-3.h5 vocDB_ion_mass vocDB_ion_formula vocDB_name_iupac isotope
51.0438 51.0438 0.0627241414 0.0513963845 0.0195223387 0.0197875483 0.0156695431 0.0207212498 NA
52.0471 52.0471 NA NA NA 0.0078027569 0.0010303386 0.0021660908 NA
53.0384 53.0384 0.3119010562 0.3143120573 0.1778463612 0.1749327425 0.2928767284 0.3383822642 53.03857 [C4H4+H]+ but-1-en-3-yne NA
55.039 55.0390 0.7088710922 0.5859147707 0.6632503333 1.1428881020 0.2010073052 0.4702364344 NA
55.0534 55.0534 2.9867439508 3.9925226001 2.7491978600 2.4186937005 4.9778561330 3.3858188324 55.05422 [C4H6+H]+ but-1-yne, but-2-yne, buta-1,2-diene, buta-1,3-diene NA
57.0327 57.0327 1.0369903572 1.2822501597 0.7324881472 0.8874536045 1.0084791207 1.0761185780 57.03349 [C3H4O+H]+ prop-2-enal NA

5.8 writeEset: Export data and metadata to 3 tabular files

Finally, the ExpressionSet can be exported to 3 tabulated files ‘dataMatrix.tsv’, sampleMetadata.tsv’, and ‘variableMetadata.tsv’:

writeEset(exhaledEset, dirC = file.path(getwd(), "processed_dataset"))

5.9 Statistical analysis

The ExpressionSet object is now ready for subsequent data analysis (e.g. data mining, classification or feature selection) with the many R packages.

As an example, we describe here how to perform Exploratory Data Analysis with the ropls package (???).

As a preliminary step, log transformation is often used in Mass Spectrometry to stabilize the variance:

Biobase::exprs(exhaledEset) <- log2(Biobase::exprs(exhaledEset))

Then, the data and metadata may be printed and plotted with the view method:

ropls::view(Biobase::exprs(exhaledEset),printL=FALSE)

To avoid redundancy, the isotopes may be discarded:

isotopes<-Biobase::fData(exhaledEset)[,"isotope"]
isotopes<-isotopes[!is.na(isotopes)]
exhaledEset <- exhaledEset[!(Biobase::fData(exhaledEset)[, "ion_mass"] %in% isotopes), ]

Principal Component Analysis may then be performed (due to the limited number of samples, cross-validation is decreased to 5 instead of 7):

exhaledPca<-ropls::opls(exhaledEset,crossvalI=5,info.txtC="none",fig.pdfC="none")
ropls::plot(exhaledPca, parAsColFcVn=Biobase::pData(exhaledEset)[, "individual"],typeVc="x-score")

Subsequent supervised analysis, e.g. (Orthogonal) Partial Least Squares, or feature selection may be performed with the ropls and biosigner packages, respectively.

To get the most important variable that contribute to the dimension that discriminate the two individual (here the first dimension), we look at the loadings:

load1<-ropls::getLoadingMN(exhaledPca)[,1]
barplot(sort(abs(load1),decreasing = TRUE))

knitr::kable(Biobase::fData(exhaledEset)[names(sort(abs(load1),decreasing = TRUE)),c("vocDB_ion_formula","vocDB_name_iupac")])
vocDB_ion_formula vocDB_name_iupac
52.0471
135.114 [C10H14+H]+ 1-ethyl-2,3-dimethylbenzene, 1-ethyl-2,4-dimethylbenzene, 1-ethyl-3,5-dimethylbenzene, 1-methyl-2-propan-2-ylbenzene, 1-methyl-2-propylbenzene, 1-methyl-3-propan-2-ylbenzene, 1-methyl-4-propan-2-ylbenzene, 1-methyl-4-propylbenzene, 1,2,3,4-tetramethylbenzene, 1,2,3,5-tetramethylbenzene, 1,2,4,5-tetramethylbenzene, 1,3-diethylbenzene, 1,4-diethylbenzene, 2-ethyl-1,4-dimethylbenzene, 2-methyl-5-prop-1-en-2-ylcyclohexa-1,3-diene, 4-ethyl-1,2-dimethylbenzene, butylbenzene
67.0535 [C5H6+H]+ (E)-pent-3-en-1-yne, 2-methylbut-1-en-3-yne, cyclopenta-1,3-diene, pent-3-en-1-yne
110.9724
55.039
69.0697 [C5H8+H]+ (3E)-penta-1,3-diene, (3Z)-penta-1,3-diene, 2-methylbuta-1,3-diene, 3-methylbuta-1,2-diene, pent-2-yne, penta-1,2-diene, penta-1,4-diene
80.0492 [C5H5N+H]+ pyridine
53.0384 [C4H4+H]+ but-1-en-3-yne
131.0979
87.0797 [C5H10O+H]+ 2-methylbutanal, 3-methylbutan-2-one, 3-methylbutanal, pentan-2-one, pentanal
57.0327 [C3H4O+H]+ prop-2-enal
59.0496 [C3H6O+H]+ prop-2-en-1-ol, propan-2-one, propanal
123.0415 [C7H6O2+H]+ hydron;benzoate
137.1307 [C10H16+H]+ (3E)-3,7-dimethylocta-1,3,6-triene, (3E,5E)-3,7-dimethylocta-1,3,5-triene, 1-methyl-4-prop-1-en-2-ylcyclohexene, 1-methyl-4-propan-2-ylcyclohexa-1,3-diene, 1-methyl-4-propan-2-ylcyclohexa-1,4-diene, 1-methyl-4-propan-2-ylidenecyclohexene, 2-methyl-5-propan-2-ylcyclohexa-1,3-diene, 2,2-dimethyl-3-methylidenebicyclo[2.2.1]heptane, 2,6,6-trimethylbicyclo[3.1.1]hept-2-ene, 3-methylidene-6-propan-2-ylcyclohexene, 3,7,7-trimethylbicyclo[4.1.0]hept-3-ene, 4-methylidene-1-propan-2-ylcyclohexene, 6,6-dimethyl-2-methylidenebicyclo[3.1.1]heptane, 7-methyl-3-methylideneocta-1,6-diene
81.069 [C6H8+H]+ (3Z)-hexa-1,3,5-triene, (Z)-3-methylpent-3-en-1-yne, 1-methylcyclopenta-1,3-diene, 3-methylidenecyclopentene, 4-methylidenecyclopentene, 5-methylcyclopenta-1,3-diene, cyclohexa-1,3-diene, cyclohexa-1,4-diene
51.0438
63.0435
60.0522
115.0743 [C6H10O2+H]+ (E)-4-methylpent-2-enoic acid, 2-methylprop-2-enyl acetate, hexane-2,5-dione, propyl prop-2-enoate
149.1155 [C7H16O3+H]+, [C8H17Cl+H]+ 1-(1-methoxypropan-2-yloxy)propan-2-ol, 1-(2-methoxypropoxy)propan-2-ol, 2-chlorooctane
126.9685
65.0591
89.0585 [C4H8O2+H]+ 1,4-dioxane, 2-methylpropanoic acid, 3-hydroxybutanal, butanoic acid, ethyl acetate, methyl propanoate, propyl formate
95.0847 [C7H10+H]+ 1-methylcyclohexa-1,4-diene, 1,2-dimethylcyclopenta-1,3-diene, 3-methylidenecyclohexene, 5,5-dimethylcyclopenta-1,3-diene, cyclohepta-1,3-diene, hepta-1,3,5-triene
71.0478 [C4H6O+H]+ (E)-but-2-enal, 2-methylprop-2-enal, 2,3-dihydrofuran, 2,5-dihydrofuran, but-3-en-2-one
58.0405
79.0382
85.0635 [C5H8O+H]+ (E)-2-methylbut-2-enal, (E)-pent-3-en-2-one, 3-methylbut-2-enal, 3-methylbut-3-en-2-one, 3,4-dihydro-2H-pyran, cyclopentanone, pent-1-en-3-one, pentanal
81.0259
109.0673 [C7H8O+H]+ 2-prop-2-enylfuran, 3-methylphenol, 4-methylphenol, phenylmethanol
63.0072
95.048 [C6H6O+H]+ 2-ethenylfuran, hydron;phenoxide
89.0417 [C4H8S+H]+ 1-methylsulfanylprop-1-ene, 3-methylsulfanylprop-1-ene
91.0727 [C4H10O2+H]+ 1-methoxypropan-2-ol, 2-ethoxyethanol, butane-2,3-diol
75.0434 [C3H6O2+H]+ 1-hydroxypropan-2-one, 1,3-dioxolane, 2-methoxyacetaldehyde, ethyl formate, methyl acetate, propanoic acid
63.0259 [C2H6S+H]+ ethanethiol, methylsulfanylmethane
103.0745 [C5H10O2+H]+ 3-methylbutanoic acid, ethyl propanoate, pentanoic acid, propan-2-yl acetate, propyl acetate
74.0598 [C3H7NO+H]+ N,N-dimethylformamide, propanamide
77.0589 [C3H8O2+H]+ 2-methoxyethanol, dimethoxymethane, propane-1,2-diol
79.074
65.0202
55.0534 [C4H6+H]+ but-1-yne, but-2-yne, buta-1,2-diene, buta-1,3-diene
141.0564
93.0349
107.0844 [C8H10+H]+ 1,2-xylene, 1,3-xylene, 1,4-xylene, ethylbenzene
64.0295
117.0897 [C6H12O2+H]+ 2-methylpropyl acetate, 4-methylpentanoic acid, butyl acetate, ethyl butanoate, hexanoic acid, propyl propanoate
88.0772 [C4H9NO+H]+ 2-methylpropanamide
133.1013 [C10H12+H]+ [(E)-but-1-enyl]benzene, [(E)-but-2-enyl]benzene, 1-ethenyl-2,4-dimethylbenzene, 1-ethenyl-3-ethylbenzene, 1-ethenyl-3,5-dimethylbenzene, 1-methyl-4-prop-1-en-2-ylbenzene, 2-ethenyl-1,3-dimethylbenzene, 2-methylprop-1-enylbenzene, 4-ethenyl-1,2-dimethylbenzene, but-1-en-2-ylbenzene

We then could plot the raw data around this masses to check the robustness of these potential marker:

plotFeatures(exhaledPtrset,mz = 53.0387,typePlot = "ggplot",colorBy = "individual")

plotFeatures(exhaledPtrset,mz = 67.0539,typePlot = "ggplot",colorBy = "individual")

plotFeatures(exhaledPtrset,mz = 135.1149,typePlot = "ggplot",colorBy = "individual")

plotFeatures(exhaledPtrset,mz = 137.1315,typePlot = "ggplot",colorBy = "individual")

5.10 Processing a single raw file

To read and access the entire raw data of a single file, use the readRaw function. It opens the data from an absolute file path with the rhdf5 library, and returns a ptrRaw object containing the raw data matrix, the time axis, the m/z axis obtained after calibration if calibTIS = TRUE, and additional information contained in the h5 file (transmission curve, drift temperature, …).

As an example, we get the absolute file path for the first raw file:

samplePath<-list.files(dirRaw,recursive = TRUE,full.names = TRUE,pattern = ".h5$")[1]

To read the file:

sampleRaw <- readRaw(samplePath, calibTIS = FALSE)
sampleRaw
## ind1-1.h5 
##   mz range:  20.4 - 150.7 
##   time range:  0 - 49 
##   Calibration error in ppm: 
##      No calibration performs

At this stage, the same methods as with the ptrSet object can be used: calibration, timeLimits, plotCalib, plotTIC, plotRaw, and detectPeak.

6 Acknowledgements

This research was conducted by Camille Roquencourt, with the contributions from Paul Zheng, Pierrick Roger-Mele and Etienne A. Thevenot (Data Sciences for Deep Phenotyping and Precision Medicine team at CEA), in collaboration with Stanislas Grassin-Delyle, Helene Salvator, Emmanuel Naline, Philippe Devillier and Louis-Jean Couderc (Exhalomics platform at the Foch Hospital) within the SoftwAiR project funded by the Agence Nationale de la Recherche (ANR-18-CE45-0017 grant).

7 Annex

7.1 Peak detection algorithm

For each nominal mass m/z:

  • Selection of the m/z range (Müller et al. 2013): [m/z - 0.6, m/z + 0.6] and definition of m/z windows for noise \([m/z - 0.6, m/z - 0.4] \cup [m/z + 0.4, m/z + 0.6]\) and signal \([m/z - 0.4, m/z + 0.4]\)

  • Removing baseline (???)

  • Smoothing the signal with Savitzky Golay windows (Savitzky and Golay 1964) computed by using noise autocorrelation (Vivó-Truyols and Schoenmakers 2006)

  • Detection of peak maxima by using the derivatives of Savitzky Golay filter (all concave regions are determined with the second derivative, and the maxima correspond in each of those regions to the point with the first derivative closest to zero)

  • Thresholding based on the maximum noise value around the nominal mass.

  • Iterative fit of the sech2 function (???): \(peak_i(h_i,m_i,\Delta_{1i},\Delta_{2i}) = \frac{h_i}{cosh(\Delta_{1i} \times (x-m_i))^2 \times (x \leq m_i)+cosh(\Delta_{2i} \times (x-m_i))^2 \times (x>m_i)}\), with initialization and the following constraints:

    • \(m\): mass of the peak maxima detected \(\pm [\Delta/4]\)
    • \(\Delta_1\)= \(\Delta_2\) = \(m\)/5000 (the 5,000 value been chosen as the mean resolution of the instrument)
    • \(h\): intensity at the peak apex
  • Quantification by summing the fitted peaks over a range of \(10 \times \Delta_i\)

peakList<-PeakList(sampleRaw, mzNominal = c(63), 
                                plotFinal=TRUE ,plotAll =TRUE)

peakList$peak
##            Mz  quanti_cps      delta_mz  resolution    parameter.1
## 1 63.00495668 1985.856969 0.01211830260 5199.156907 0.005249331919
## 2 63.02435438 3048.749568 0.01111504749 5670.183094 0.005463106500
## 3 63.04161258 1416.440355 0.02246965880 2805.632838 0.016224352426
##      parameter.2  parameter.3           R2 overlap
## 1 0.006868970682 259.44937755 0.9990058167       1
## 2 0.005651940991 434.23359852 0.9990058167       1
## 3 0.006245306372  99.81826902 0.9990058167       1

8 Session Info

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
## [5] LC_TIME=French_France.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ptairData_0.1.0   ptairMS_1.0.2     data.table_1.13.2 BiocStyle_2.18.1 
## 
## loaded via a namespace (and not attached):
##   [1] bit64_4.0.5         doParallel_1.0.16   RColorBrewer_1.1-2 
##   [4] httr_1.4.2          tools_4.0.3         backports_1.2.0    
##   [7] R6_2.5.0            enviPat_2.4         rpart_4.1-15       
##  [10] BiocGenerics_0.36.0 Hmisc_4.4-2         lazyeval_0.2.2     
##  [13] colorspace_2.0-0    nnet_7.3-14         rhdf5filters_1.2.0 
##  [16] tidyselect_1.1.0    gridExtra_2.3       bit_4.0.4          
##  [19] curl_4.3            compiler_4.0.3      chron_2.3-56       
##  [22] Biobase_2.50.0      htmlTable_2.1.0     plotly_4.9.2.1     
##  [25] labeling_0.4.2      bookdown_0.21       checkmate_2.0.0    
##  [28] scales_1.1.1        stringr_1.4.0       digest_0.6.27      
##  [31] foreign_0.8-80      rmarkdown_2.5       rio_0.5.16         
##  [34] base64enc_0.1-3     jpeg_0.1-8.1        pkgconfig_2.0.3    
##  [37] htmltools_0.5.0     highr_0.8           htmlwidgets_1.5.2  
##  [40] rlang_0.4.9         readxl_1.3.1        rstudioapi_0.13    
##  [43] farver_2.0.3        generics_0.1.0      jsonlite_1.7.2     
##  [46] crosstalk_1.1.0.1   ropls_1.22.0        dplyr_1.0.2        
##  [49] zip_2.1.1           car_3.0-10          magrittr_2.0.1     
##  [52] Formula_1.2-4       Matrix_1.2-18       Rcpp_1.0.5         
##  [55] munsell_0.5.0       Rhdf5lib_1.12.0     abind_1.4-5        
##  [58] lifecycle_0.2.0     stringi_1.5.3       yaml_2.2.1         
##  [61] carData_3.0-4       MASS_7.3-53         rhdf5_2.34.0       
##  [64] grid_4.0.3          parallel_4.0.3      forcats_0.5.0      
##  [67] crayon_1.3.4        lattice_0.20-41     haven_2.3.1        
##  [70] cowplot_1.1.0       splines_4.0.3       hms_0.5.3          
##  [73] magick_2.5.2        knitr_1.30          pillar_1.4.7       
##  [76] ggpubr_0.4.0        ggsignif_0.6.0      codetools_0.2-18   
##  [79] glue_1.4.2          evaluate_0.14       latticeExtra_0.6-29
##  [82] BiocManager_1.30.10 vctrs_0.3.5         png_0.1-7          
##  [85] foreach_1.5.1       cellranger_1.1.0    gtable_0.3.0       
##  [88] purrr_0.3.4         tidyr_1.1.2         ggplot2_3.3.2      
##  [91] xfun_0.19           openxlsx_4.2.3      broom_0.7.2        
##  [94] rstatix_0.6.0       survival_3.2-7      viridisLite_0.3.0  
##  [97] signal_0.7-6        minpack.lm_1.2-1    tibble_3.0.4       
## [100] iterators_1.0.13    cluster_2.1.0       ellipsis_0.3.1

Bibliography

Blake, Robert S., Paul S. Monks, and Andrew M. Ellis. 2009. “Proton-Transfer Reaction Mass Spectrometry.” Chemical Reviews 109 (3): 861–96. https://doi.org/10.1021/cr800364q.

Delabrière, Alexis, Ulli M Hohenester, Benoit Colsch, Christophe Junot, François Fenaille, and Etienne A Thévenot. 2017. “proFIA: A Data Preprocessing Workflow for Flow Injection Analysis Coupled to High-Resolution Mass Spectrometry.” Edited by Oliver Stegle. Bioinformatics 33 (23): 3767–75. https://doi.org/10.1093/bioinformatics/btx458.

Devillier, Philippe, Helene Salvator, Emmanuel Naline, Louis-Jean Couderc, and Stanislas Grassin-Delyle. 2017. “Metabolomics in the Diagnosis and Pharmacotherapy of Lung Diseases.” Current Pharmaceutical Design 23 (14). https://doi.org/10.2174/1381612823666170130155627.

Hansel, A., A. Jordan, R. Holzinger, P. Prazeller, W. Vogel, and W. Lindinger. 1995. “Proton Transfer Reaction Mass Spectrometry: On-Line Trace Gas Analysis at the Ppb Level.” International Journal of Mass Spectrometry and Ion Processes 149-150 (November): 609–19. https://doi.org/10.1016/0168-1176(95)04294-U.

Kuo, Tien-Chueh, Cheng-En Tan, San-Yuan Wang, Olivia A Lin, Bo-Han Su, Ming-Tsung Hsu, Jessica Lin, et al. 2020. “Human Breathomics Database” 2020: 8.

Lacy Costello, B de, A Amann, H Al-Kateb, C Flynn, W Filipiak, T Khalid, D Osborne, and N M Ratcliffe. 2014. “A Review of the Volatiles from the Healthy Human Body.” Journal of Breath Research 8 (1): 014001. https://doi.org/10.1088/1752-7155/8/1/014001.

Müller, Markus, Tomáš Mikoviny, Werner Jud, Barbara D’Anna, and Armin Wisthaler. 2013. “A New Software Tool for the Analysis of High Resolution PTR-TOF Mass Spectra.” Chemometrics and Intelligent Laboratory Systems 127 (August): 158–65. https://doi.org/10.1016/j.chemolab.2013.06.011.

Savitzky, Abraham., and M. J. E. Golay. 1964. “Smoothing and Differentiation of Data by Simplified Least Squares Procedures.” Analytical Chemistry 36 (8): 1627–39. https://doi.org/10.1021/ac60214a047.

Vivó-Truyols, Gabriel, and Peter J. Schoenmakers. 2006. “Automatic Selection of Optimal Savitzky−Golay Smoothing.” Analytical Chemistry 78 (13): 4598–4608. https://doi.org/10.1021/ac0600196.